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Abstract 

This paper describes resolve, a system that 
uses decision trees to learn how to classify coref- 
erent phrases in the domain of business joint 
ventures. An experiment is presented in which 
the performance of resolve is compared to the 
performance of a manually engineered set of 
rules for the same task. The results show that 
decision trees achieve higher performance than 



the rules m two ot three evaluation metrics de- 
veloped tor the coreterence task. In addition 



human readers.^ For the final evaluations, participating 
systems were given a set of blind texts and their out- 
put was scored against the key templates to determine 
how much of the relevant information they were able to 
extract. 

The sentence analyzers used in many of these systems 
have shown significant improvement over the past several 
years. However, the discourse processing capabilities of 
these systems, particularly their coreference resolution 
components, have often been cited as weak areas |Weir | 
and Fritzson, 1993; Moldovan et ai, 1992; Aberdeen et 



to achieving better pcrtormancc than the rules, 
kesulve provides a framework that facilitates 



ai, 1992| . 



the exploration of the types of knowledge that 
are useful for solving the coreference problem. 



1 Introduction 

The goal of an Information Extraction (IE) system is to 
identify information of interest from a collection of texts. 
Within a particular text, objects of interest are often ref- 
erenced in different places and in different ways. One of 
the many challenges facing an IE system is to determine 
which references refer to which objects. This problem 
can be recast as a classification problem: given two ref- 
erences, do they refer to the same object or different 
objects. 

The Message Understanding Conferences (MUCs) 
pundhcim, 1991| ; |5undhcini, 1992t |Sundhcim, I993| | and 
the Tipster Project! Merchant, 1993 | helped both to de- 
fine the information extraction task and to push the tech- 
nology of IE systems. Each of these evaluation efforts 
provided a corpus of news articles about a domain, a 
specification of the relevant information that was to be 
extracted from each article, the output representation 
of that information, and a set of key templates repre- 
senting the information extracted from each article by 
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poration and the National Center for Automated Information 
Research. 



The IE systems developed at UMass Lehnert et ai 
19911; iLehnert et ai, 19921; Lehnert et al, 19931 also dis- 



played weak coreference resolution capabilities. Each of 
these systems used a set of manually engineered rules 
to resolve some obvious types of coreference, but they 
tended to be very conservative, i.e., they only consid- 
ered phrases to be coreferent if there was overwhelm- 
ing evidence in support of that hypothesis. One of the 
problems with these coreference resolution components 
was figuring out which features of the phrases to look at 
when determining coreference. Another, related set of 
problems was determining how to combine positive and 
negative evidence into individual rules and then how to 
order the rule set. A third problem area was the accumu- 
lation of errors at that late stage of processing, e.g., from 
incorrectly delimited sentences, incorrect part-of-speech 
tags, and other sentence analysis errors. 

In an effort to address these problems, a new approach 
to coreference resolution was begun after the MUC-5 
evaluation: a system named resolve was created to 
build decision trees that can be used to classify pairs of 
phrases as coreferent or not coreferent. The errors gener- 
ated by the sentence analyzer were eliminated by using 
a special tool - the Coreference Marking Interface, or 
CMI - to extract a set of phrases from the MUC-5 En- 
glish Joint Venture (EJV) corpus. R In order to minimize 



J The MUC-5 evaluation actually included 4 different do- 
mains, but most participants were required to select only one. 

2 The MUC-5 EJV corpus is a collection of news articles, 
written in English, that describe business joint ventures, i.e., 



the difficulties involved with creating and maintaining 
complex sets of rules, a machine learning approach was 
adopted, in which a decision tree determines the order 
and relative weight of different pieces of evidence. 



Resolve used the C4.5 decision tree system Quinlan 
199j| to learn how to classify coreferent phrases for the 
experiments reported in this paper. C4.5 was chosen pri- 
marily due to its ease of use and its widespread accep- 
tance; however, resolve can use any learning system 
that uses feature vectors composed of attribute-value 
pairs. 

2 Decision Trees vs. Rules 

An experiment was conducted to compare the perfor- 
mance of the decision trees generated by resolve with 
the performance of manually engineered rules used for 
coreference classification in the UMass/Hughes MUC-5 
IE system. A set of references, along with the coreference 
links among these, were extracted from a group of texts 
via CMI. All possible pairings of references from each text 
were generated, and these pairings were used to create 
a set of feature vectors used by resolve. The pair- 
ings that contained coreferent phrases formed positive 
instances, while those that contained two non-coreferent 
phrases formed negative instances. RESOLVE was then 
iteratively trained and tested on different partitions of 
this set of feature vectors. 

The data structure used in discourse processing by the 
UMass/Hughes MUC-5 IE system was the memory to- 
ken, which converted the case frame output from the 



CIRCUS sentence analyzer [Lehnert, 1991 into a more 



system-independent representation. Prior to corefer- 
ence processing, each memory token contained one noun 
phrase, one or more lexical patterns encompassing that 
phrase, part-of-speech tags, semantic features, and infor- 
mation that was inferred from either the phrase or the 
context in which the phrase was found. This inferred 
information included the type of object referenced by 
the phrase, any name or location substring contained in 
the phrase, and some domain-specific information such 
as whether the phrase was a joint venture parent (one of 
the entities who formed a joint venture) or joint venture 
child (the joint venture company itself). The references 
marked via CMI were converted into a memory token 
representation in order to test the performance of the 
MUC-5 system's coreference module. 

2.1 Data 

The articles in the EJV corpus describe business joint 
ventures among two or more entities (companies, govern- 
ments and/or people). The task definition provided for 
MUC-5 required IE systems to extract information about 



associations of two or more entities (companies, governments 
or people) created for the purpose of owning and/or develop- 
ing a project together. 



the entities involved, the relationships among these enti- 
ties, the facilities associated with the joint venture, the 
products or services offered by the joint venture, its capi- 
talization and revenue projections, and a variety of other 
related information. Since the entities involved in these 
joint ventures were the main focus of most of these ar- 
ticles, references to entities were much more numerous 
than references to other types of object classes, e.g., peo- 
ple. Therefore, entity references were selected as the 
focus of the experiments reported in this paper. 

Cmi is a graphical user interface that permits the user 
to mark phrases in a text; for each phrase, the user can 
indicate the object(s) with which the phrase is coreferent 
and some additional information about the phrase that 
can be inferred either from the phrase itself or its local 
context. This additional information is parameterized 
and can be modified easily for use in different domains. 
The data used in this experiment was based on a set of 
phrases extracted using CMI. 

As an example, consider the following sentence, from 
text 0970 from the MUC-5 EJV corpus: 

FAMILYMART CO. OF 

SEIBU SAISON GROUP WILL OPEN 

A CONVENIENCE STORE IN TAIPEI FRIDAY 

IN A JOINT VENTURE WITH 

TAIWAN'S LARGEST CAR DEALER , 

THE COMPANY SAID WEDNESDAY. 

The phrases underlined in this sentence contain rele- 
vant information that must be extracted by an IE sys- 
tem.^] The phrases in boldface refer to entity objects 
that are important to the MUC-5 task. As an example 
of the types of information collected about each phrase, 
consider the first phrase in the sentence: 

(:string "FAMILYMART CO." 
:slots (ENTITY 

(name "FAMILYMART CO.") 

(type COMPANY) 

(relationship JV-PARENT CHILD))) 

Information collected about each phrase includes the 
string itself, the character position of the string in the 
source text (not shown), the index of the sentence within 
which the string is found (also not shown) , and some slot 
information that can be inferred from either the string 
itself or its local context - the same kind of informa- 
tion that was contained in the memory tokens used by 
the MUC-5 system. In this example, the name of the 
entity and the fact that it is a company entity can both 
be inferred from the string itself. The fact that Fam- 
ilymart Company plans to open a store in "A JOINT 

3 Note that the phrase "THE COMPANY" in the last 
clause of the sentence is not considered relevant, since it con- 
tributes no information required for the MUC-5 task - the 
determination of who is announcing a joint venture or when 
the announcement was made are not relevant pieces of infor- 
mation. Therefore, this phrase was not marked for use in the 
experiment. 



VENTURE" with another entity is considered adequate 
evidence that the company is the parent of a joint ven- 
ture (jv-parent); the fact that the sentence contains 
the pattern " company-name- 1 OF company-name- 2" is 
evidence that company-name- 1 , in this case Familymart 
Co., is a subsidiary (child) of company-name- 2, in this 
case Seibu Saison Group. 

A second example of output from CMI can be seen be- 
low, where nationality information has been extracted 
from the reference to the car dealer: 

(:string "TAIWAN'S LARGEST CAR DEALER" 
:slots (ENTITY 

(type COMPANY) 
(relationship JV-PARENT) 
(nationality "Taiwan (COUNTRY)"))) 



IF both tokens come from the same trigger family 

THEN they are not coreferent. 

IF each token comes from a different partition 

THEN they are not coreferent. 

IF both tokens contain a common phrase 

THEN they are coreferent. 

IF both tokens refer to joint ventures 

THEN they are coreferent. 

IF both tokens contain the same company name 

THEN they are coreferent. 

IF one token contains an alias of the other 

THEN they are coreferent. 

IF only one token refers to a joint venture 

THEN they are not coreferent. 

IF each token contains different company names 

THEN they are not coreferent. 



Table 1: The MUC-5 system's coreference rules. 



In principle, much of the information gathered about 
a particular string could be found automatically: there 
are numerous proper name recognizer programs, pro- 
grams that extract location information, and sentence 
analyzers that can infer relationship information - any 
system that exhibited good performance in MUC-5 must 
be good at inferring such relationships. 

For the purposes of our experiment, however, this in- 
formation was specified by a user via CMI. The primary 
motivation for this was to minimize the noise in the data; 
coreference resolution often occurs at a late processing 
stage in an IE system, and earlier errors such as incor- 
rect part-of-speech tags, incorrectly delimited sentences 
and semantic tagging errors can create significant noise 
for a coreference classifier. 

Cmi was used to mark references to a variety of 
relevant object types (entity, facility, person and 
product-or-service) in 50 randomly selected texts. 
Since references to entity objects were most numerous, 
this was the object class chosen for the experiment. In 
the 50 texts, 472 references to a total of 205 entity ob- 
jects were marked using CMI. 

Some phrases are multireferent, i.e., they refer to more 
than one object. These multireferent phrases pose diffi- 
culties for classification, since it means that some phrases 
will be coreferent with other phrases in the text that 
have distinct referents. Thus for a set of phrase pairs 
which share a given phrase, more than one pair would 
be classified as a positive instance of coreference. Further 
complications are created for evaluating the performance 
of a coreference system when multireferent phrases are 



included in the data (see Section 2.4). To simplify the 
initial experiments reported here, multireferent phrases 
were excluded from the data set. The capability to han- 
dle such phrases will be incorporated in a later version 

of RESOLVE. 



2.2 Rules used in the MUC-5 System 

The coreference module of the UMass/Hughes MUC-5 
IE system was designed to minimize false positives, i.e., 
minimize the likelihood that two phrases that were not 
coreferent would be labeled coreferent. This design de- 
cision was based on the assumption that false posi- 
tive errors, resulting in the merging of non-coreferent 
phrases in the final system output, would harm sys- 
tem performance more than false negative errors, which 
would result in coreferent phrases showing up in dis- 
tinct objects in the system output. This rather conser- 
vative approach to coreference was shared by a num- 
ber of MUC system developers [|Appelt et ah, 199 



Ayuso et at, 1992 1, though not all [iwanska et at, 1992 



1 



In order to make things manageable for CMI annotator, 
the size of the texts was limited to 2KB, however the majority 
of texts in the EJV domain fall into this category. 



Another factor influencing the coreference module was 
the short time allotted to developing and testing this sys- 
tem component. Since coreference resolution was a late 
stage in processing, upstream components had to be sta- 
bilized before serious development could take place on 
coreference. Several late-stage components were being 
developed in parallel, so it is difficult to assess the time 
devoted exclusively to developing the coreference mod- 
ule, but we estimate it was two person- weeks. 

The rules used to determine whether two phrases 
(represented as memory tokens) were coreferent in the 
MUC-5 system are shown in Table |l| Following the pol- 
icy of minimizing false positives, whenever none of the 
rules fired, the system classified the pair of tokens as not 
coreferent. 

The UMass/Hughes MUC-5 IE system used a vari- 
ety of mechanisms to identify phrases referring to joint 
ventures (the entity formed by two or more parent en- 
tities for some particular business purpose), to identify 
company names within a phrase (if they exist), and to 
determine whether one phrase was an alias (an abbrevia- 
tion or shortened form), as well as the ability to identify 



Individual Phrases 


Pair of Phrases 


Attribute 


Value 


Attribute 


Value 


NAME-1 


YES 


ALIAS 


NO 


JV-CHILD-1 


NO 


BOTH-JV-CHILD 


NO 


NAME-2 


YES 


COMMON-NP 


NO 


JV-CHILD-2 


NO 


SAME-SENTENCE 


NO 



Table 2: Attributes and Values for EJV entity instance. 



trigger families^] and partitions^ in the text. 

One of the many difficulties in developing the rule set 
for coreference classification was in ordering the rules. 
Several different orderings were tested during the de- 
velopment period, and the order shown above was the 
ordering of the rule set used for final evaluation. This 
difficulty in rule ordering was one of the motivations be- 
hind using a machine learning approach - we wanted to 
develop a system that could learn how to combine the 
positive and negative evidence. 

2.3 Features Used By RESOLVE 

A decision tree requires data to be represented by feature 
vectors, i.e., vectors of attribute/ value pairs. For the 
task of coreference classification, references were paired 
up, and features were extracted from the pair of ref- 
erences as well as from the individual references them- 
selves. Since this experiment involved a comparison be- 
tween resolve and a manually engineered rule set, the 
features used in this experiment were based on the an- 
tecedents of the coreference rules used in the UMass/ 
Hughes MUC-5 IE system. 

For example, Table || shows a feature vector that rep- 
resents the pairing of the phrases "FAMILYMART CO." 
and "TAIWAN'S LARGEST CAR DEALER" . Since the 
two phrases are not coreferent, this represents a negative 
instance. 

Of the 8 features used in this experiment, two focus on 
the first reference, two focus on the second reference and 
four are based on the pair of references. The following is 
a brief description of the features that focus on individual 
phrases, where i £ {1,2}. 

• NAME-i: Does reference i contain a name? Possible 
values: {yes, no}. 

• JV-CHILD-i: Does reference i refer to a joint venture 
child, i.e., a company formed as the result of a tie- 
up among two or more entities? Possible values: 
{yes, no, unknown}. 

The last four features focus on the pair of references. 

5 A trigger family is a set of phrases all triggered off the 
same word, e.g., a subject and direct object joined by the 
same verb phrase. 

6 A partition is a portion of the text that is focusing on the 
same main topic. For the MUC-5 system, distinct partitions 
were recognized only for texts that had bulleted items, as one 
might see in a news summary of the days headlines. Most 
texts thus had a single partition. 



• ALIAS: Does one reference contain an alias of the 
other, i.e., does each reference contain a name and is 
one name a substring of the other name ?[] Possible 
values: {yes, no}. 

• BOTH-JV-CHILD: Do both references refer to a joint 
venture child? This feature is defined as 

yes when Vi, JV-CHiLD-i = yes 
no when Mi, Jv-CHiLD-i = NO 
unknown otherwise. 

• COMMON-NP: Do the references share a common 
noun phrase? Some references contain non-simple 
noun phrases, e.g., appositions and relative clauses. 
This feature compares the simple constituent noun 
phrases of each reference. Possible values: {yes, 
no}. 

• SAME-SENTENCE: 

Do the references come from the same sentence? 
Resolve does not use circus output, and thus has 
no notion of a trigger family as it was used in the 
MUC-5 system; the same-sentence feature is a 
very weak attempt to extract this sort of informa- 
tion. Possible values: {yes, no}. 

1230 feature vectors, or instances, were created from 
the entity references marked in the 50 texts. Of 
these, 322 (26%) were positive ("+") instances - pairs 
of phrases that were coreferent - and 908 (74%) were 
negative ( "-" ) instances - pairs of phrases that were not 
coreferent. Figure [l] shows a pruned C4.5 decision tree 
trained on all the instances. 

2.4 Evaluation Methodology 

Coreference is a symmetrical and transitive relation that 
holds among a set of two or more references, e.g., if wc 
know that A is coreferent with B, and B is coreferent 
with C, then there is an implicit coreference "link" be- 
tween A and C.^ Any coreference classification for two 
references has implications beyond the determination of 
whether that particular classification was correct or in- 
correct. For example, if A and B are correctly classified 
as coreferent, but B and C are incorrectly classified as 
not coreferent, a system may also incorrectly conclude 

7 Note that some texts contain more than one entity for 
which a given name might be an alias under this definition, 
e.g., "SUMITOMO" is a substring of both "SUMITOMO 
CORP." and "SUMITOMO ELECTRICAL INDUSTRIES 
LTD.", so this feature is not always a reliable indicator of 
coreference. 

8 As was noted earlier, some references are multireferent, 
i.e., they have more than one referent. Thus, if B is multiref- 
erent, we cannot conclude that A is coreferent with C; for 
example, if A = Sneezy, B = the dwarfs and C = Grumpy, 
we don't want to infer that Sneezy — Grumpy. We can ig- 
nore such complications in this paper since the experiments 
reported herein exclude multireferent phrases. 




Figure 1: A pruned C4.5 decision tree 

that A and C are not coreferent. Thus, simply measur- 
ing the accuracy of a coreference classifier is inadequate 
for evaluating how well the classifier performs its task. 

Two metrics that have been used to evaluate the per- 
formance of IE systems are recall and precision 
choil, 1991; Chinchor, 1992: Chinchor and Sundheim 



199j[[. Recall is the percentage of information in a text 
that is correctly extracted by a system; precision is the 
percentage of information extracted by a system that is 
correct. For example, if a text contains four relevant 
items (represented by {A, B, C, D} in an answer key), 
and a system correctly extracts the three items {^4, B, 
C} but incorrectly extracts the two additional items {E. 
F} (represented by {A, B, C, E, F} in a system re- 
sponse), then its recall would be 75% and its precision 
would be 60%. 

A function to combine recall and precision into a sin- 
gle measure of performance was incorporated into the 
Fourt h Message Unde rstanding Evaluation and Confer- 
ence IChinchor, 199| . The F-measure, a metric used 



to eval uate Information Ret rieval (IR) system perfor- 
mance [ van R.ijsbcrgen, 1979 ], combines recall and pre- 
cision scores into a single number using the formula 

_ (ft 2 + 1.0) x Px R 
~ (3 2 x P + R 

where P is the precision score, R is the recall score and (3 
is the relative weight given to recall over precision. For 
example, a [3 value of 1.0 gives equal weight to recall and 
precision; a value of 2.0 gives recall twice the weight of 
precision; a value of 0.5 gives recall half the weight of 
precision. 

An evaluation methodology for the coreference task is 



System 


Recall 


Precision 


F-measure 


Resolve (unpruned) 


85.4% 


87.6% 


86.5% 


Resolve (pruned) 


80.1% 


92.4% 


85.8% 


MUC-5 rule set 


67.7% 


94.4% 


78.9% 



Table 3: Results for EJV entity coreference resolution. 



being developed for the upcoming Sixth Message Under- 
standing Evaluation and Conference (MUC-6). The met- 
rics used for evaluating overall IE system performance 
are being adapted for use on this subtask (cf. 



Burger ei 



at, 1994]), where the answer key for each text contains 
a set of phrases and the coreference links among them. 
However, evaluation of coreference performance is com- 
plicated by the need to take into account the implicit 
coreference links among phrases. Thus, transitive clo- 
sures are taken for both the answer key (the key closure) 
and the system response (the response closure). Recall is 
measured by the percentage of explicit coreference links 
in the key that are also found in the response closure, 
i.e., what fraction of correct coreference links is implied 
by the transitive closure of the coreference links in the 
system response. Precision is measured by the percent- 
age of explicit coreference links in the response that are 
also found in the key closure, i.e., what fraction of coref- 
erence links in the response is implied by the transitive 
closure of the coreference links in the key. 

2.5 Results 

One experiment was run using resolve. In this ex- 
periment, for each set of instances taken from the 50 
texts, one set was selected for testing purposes and the 
remaining sets were used to train a new decision tree. 
This process was iterated over all 50 sets of instances. 
The results shown in Table [| represent the average of 
these iterations: the first row shows the recall, precision 
and F-measure (/? = 1.0) scores for unpruned decision 
trees; the second row shows the results for pruned deci- 
sion trees 

The third row in Table ^ shows the results from a 
second experiment, in which the rule set from the coref- 
erence module of the UMass/Hughes MUC-5 IE system 
was applied to the memory token pairs generated from 
the references marked using CMI. 

2.6 Discussion 

When we first began applying decision trees to the coref- 
erence resolution problem, we were hoping to achieve 
performance that was comparable to the manually engi- 
neered rules we had used in MUC-5. We were greatly en- 
couraged to discover that we could achieve performance 
that surpassed the performance of the rules from our 
MUC-5 system in both recall and F-measure scores. 



Default settings for all C 4.5 parameters were used 
throughout this experiment (see [ Quinlan, 1993| , Chapter 9, 
for more information about C4.5 parameters) . 



System 


P = 2.0 


p = 1.0 


P = 0.5 


Resolve (unpruned) 


85.8% 


86.5% 


87.1% 


Resolve (pruned) 


82.3% 


85.8% 


89.6% 


MUC-5 rule set 


71.8% 


78.9% 


87.5% 



Table 4: F-measures for different values of (3. 



As was noted earlier, the MUC-5 coreference rules 
were designed to minimize false positives. The effect 
of this bias can be seen in the higher precision score 
achieved by the rule set in comparison with both the 
unpruned and pruned decision trees. The difference in 
precision scores between the unpruned and pruned ver- 
sions of the decision trees might be explained by the 
prevalence of negative instances (74%) in the data set, 
which may lead to a stronger bias to classify pairs of 
phrases as not coreferent in the smaller trees. 

The comparative effects of false positives and false 
negatives in coreference classification on overall IE sys- 
tem performance remains an open question. However, 
while the precision scores achieved by the decision trees 
and the rule-base are rather close, especially for the 
pruned version of the trees, there is a large difference 
between their recall scores. Until we can ascertain the 
relative importance of high recall vs. high precision in 
overall IE system performance, the F-measure score that 
gives equal weight to recall and precision may be the best 
indicator of overall performance on the coreference res- 
olution task. However, as can be seen in Table [|, when 
resolve uses pruning, its performance surpasses that of 
the rule set even when the recall score is given twice the 
weight of precision score or when the recall score is given 
half the weight of precision score .0 

3 Conclusions 

One of the original goals of this new approach was to de- 
velop a system that achieved good performance in resolv- 
ing references - performance that was at least as good 
as the performance achieved using manually engineered 
rules in our MUC-5 system. However, as we continue 
to pursue this approach, we find that there is another 
advantage to using decision trees: they allow us to focus 
on determining which features work well for resolving 
references. 

We are encouraged by the performance of the decision 
trees on the coreference resolution problem. The fea- 
tures we have used in the experiment described above 
are not considered comprehensive by any means. While 
they have proved sufficient for attaining a certain level of 
performance, an examination of specific errors made by 
the trees shows that additional features will be needed 
to attain higher levels. 



The pruned decision trees yield higher F-measure scores 
than the MUC-5 rule set unless the recall score is given less 
than one-third the weight of the precision score. 



One area we will develop further is a set of features 
that incorporate syntactic knowledge. We don't have 
any features that identify the various syntactic con- 
stituents of a sentence, e.g., subject or direct object, nor 
do we have any features that identify clause boundaries 
(only sentence boundaries). These features will be in- 
corporated in future experiments. Features based on fo- 
cus of attention [ [Bidncr, 1979| ; Grosz et ai, 1982], which 
presuppose knowledge about syntactic constituents may 
also prove useful. Our experiment used a feature set 
that was largely semantic in nature: it is interesting to 
see how well semantic features work as a basis for coref- 
erence resolution ... and it is not surprising to see that 
they are also insufficient. 

Ultimately, we hope to understand better which fea- 
tures are important for coreference classification, across 
different objects and different domains. Such an under- 
standing would benefit people involved with IE system 
development, and should be of interest to people outside 
the IE community as well. We think that decision trees 
are an important tool in a systematic study of corefer- 
ence resolution. 
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